date()
## [1] "Thu Dec 09 16:17:32 2021"
Well, I had never coded in my life before last January, so I am still a bit of a “noob” in these type of tasks. Regardless, I am a fast learner. I have been able to design some robust pipelines for data analysis, but I have maintained them locally. It is very exciting for me to start learning to use GitHub and all of its applications. The description of the course got me really excited since I am always copy/pasting code and referencing the source, and I think these abilities (those shown in the course) will enable me to be more agile in my coding.
Here is my forked repository:
https://github.com/GonzalezVictorM/IODS-project or here
date()
## [1] "Thu Dec 09 16:17:32 2021"
Here we go again…
Remember! Set your working directory and load your libraries.
knitr::opts_knit$set(root.dir ="C:/Users/ramosgon/OneDrive - University of Helsinki/Courses/OpenDataScience/IODS-project/Data")
library(tidyverse)
library(ggplot2)
library(GGally)
library(hrbrthemes)
library(janitor)
The data we will be working on comes from international survey of Approaches to Learning, made possible by Teachers’ Academy funding for KV in 2013-2015.
The data shows the learning approaches in attitude, surface, deep and strategic learning of the subjects including their gender, age and points obtained. This data has been pre-processed for analysis by consolidating 60 variables into 7 of importance. As well, from 183 observations, 166 were selected as valuable by filtering Points > 0.
First, we read the data and explore the dimensions, structure and head.
learning2014 <- read.table("learning2014.csv", header = T, sep = ",")
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
As we can observe, the data contains 166 observation of 7 variables as expected. The columns included are gender, age, attitude, deep, strategic, surface, and points. Looking at the head, we confirm the organization of the table matches the dimensions and structure.
Once we understand the structure of our data, we can move to the analysis. The goal is to understand the effect that the different learning variables have over the points in the examination. This can be aided by observing if gender or age have an effect over the results.
Our first step is to do Exploratory Data Analysis using scatter plots, box plots, and histograms as well as descriptive statistics such as correlation coefficients, mean and variance. Thankfully, ggpairs consolidates most of that information in a single figure with the following code.
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
But, what should we focus on?
We can start by observing the age and gender distributions that are described in the first and second columns. From there, we can conclude that:
Additionally, I personally prefer to look at violin plots instead of boxplots due to violin plots providing also the density of the data. (The explanation to this plot can be found here).
# Narrow the data for Learning type scores of Female subjects
female_lt <- learning2014 %>%
filter(gender == "F") %>%
.[,c(3:6)] %>%
gather(key="LearningType", value="Score") %>%
mutate(gender = "F")
# Narrow the data for Learning type scores of Male subjects
male_lt <- learning2014 %>%
filter(gender == "M") %>%
.[,c(3:6)] %>%
gather(key="LearningType", value="Score") %>%
mutate(gender = "M")
# Bind female and male data
both_lt <- rbind(female_lt,male_lt)
# Print the data
both_lt %>%
mutate(LearningStyle = paste0(LearningType, "\n", gender)) %>%
ggplot(aes(x = LearningStyle, y = Score, fill = gender)) +
geom_violin(width = 1, alpha = 0.5) +
geom_boxplot(width = 0.1, alpha = 0.2)
This plot helps us confirm that there are no uneven densities in the different genders and learning types supporting the normality of each variable. Still we can see outliers in the lower tail of attitude M, deep F, and deep M. We will keep this in mind when we move to the diagnostic plots.
Lastly, we can look at the mean and standard deviation of the scores separated by gender in the form of a table. In the following table, you can notice the similarities between the genders.
learning2014 %>%
group_by(gender) %>%
summarise(n=n(),
attitudeMean = mean(attitude),
attitudeSD = sd(attitude),
deepMean = mean(deep),
deepSD = sd(deep),
StraMean = mean(stra),
StraSD = sd(stra),
surfMean = mean(surf),
surfSD = sd(surf),
pointsMean = mean(points),
pointsSD = sd(points)) %>% # The summarise function gives you the statistics that you want.
t %>% as.data.frame %>%
row_to_names(row_number = 1)
## F M
## n 110 56
## attitudeMean 2.990000 3.442857
## attitudeSD 0.7251637 0.6463484
## deepMean 3.656818 3.724702
## deepSD 0.5349331 0.5924439
## StraMean 3.201136 2.964286
## StraSD 0.7477212 0.8008214
## surfMean 2.829545 2.703869
## surfSD 0.4580220 0.6423441
## pointsMean 22.32727 23.48214
## pointsSD 5.826402 6.006031
The box plot, the violin plot and the table have the same goal, to provide descriptive statistics for us to identify unusual constructs in our data such as outliers or non-Normal distribution of our results. This last one is particularly important since some modelling strategies assume normality in the data.
Once we observe the effects of gender and age, we can continue to evaluate the correlation between the learning types and the points of the examination. This can be done from the figure above or by plotting each scatter plot with their linear models separately. I always like to double click into each correlation, so here is a code for that.
# Attitude vs points
cor <- signif(cor(learning2014$attitude,learning2014$points),2) #calculate the correlation between the two variables
ggplot(learning2014, aes(x = attitude, y = points, col = gender)) + # Plot the variables stratified by gender
geom_point() +
geom_smooth(method ="lm", formula = 'y ~ x') + # Include the linear model
ggtitle(paste("Comparison of attitude and points by gender. Correlation = ", cor)) # Create a title including the correlation coefficient
# Deep vs points
cor <- signif(cor(learning2014$deep,learning2014$points),2)
ggplot(learning2014, aes(x = deep, y = points, col = gender)) +
geom_point() +
geom_smooth(method ="lm", formula = 'y ~ x') +
ggtitle(paste("Comparison of deep and points by gender. Correlation = ", cor))
# Strategic vs points
cor <- signif(cor(learning2014$stra,learning2014$points),2)
ggplot(learning2014, aes(x = stra, y = points, col = gender)) +
geom_point() +
geom_smooth(method ="lm", formula = 'y ~ x') +
ggtitle(paste("Comparison of strategic and points by gender. Correlation = ", cor))
# Surface vs points
cor <- signif(cor(learning2014$surf,learning2014$points),2)
ggplot(learning2014, aes(x = surf, y = points, col = gender)) +
geom_point() +
geom_smooth(method ="lm", formula = 'y ~ x') +
ggtitle(paste("Comparison of surf and points by gender. Correlation = ", cor))
From these plots we can conclude:
We start by including the three variables of interest into the multiple linear model and looking at the ANOVA presented by the summary() function.
my_model <- lm(points ~ attitude + stra + surf, data = learning2014) # Create the model
summary(my_model) # Print the summary of results
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
First thing that jumps to view is the t statistic and its associated p-values. Both the intercept and the attitude coefficient give significant values (<0.05), while the rest remain non-significant. The p-value of the t statistic tells us the probability that the variation in the target variable being explained by the explanatory variable is happening by chance (due to standard error). When this probability is under our significance level, usually 0.05, we consider that the coefficient of our explanatory variable significantly explains the variation in the target variable. Therefore, we tentatively conclude that at least the attitude has a significant relationship with point.
Then, we observe the F-statistic and its associated p-value. The F-statistic tests the regression as a whole, which avoids the problems of p-values in multiple testing (which I think will be covered later in the course). Therefore, we also expect it to have a significant (<0.05) value, which it does! This means that at least one of the explanatory variables is causing this model to explain the variation in the target variable better than the Null distribution (all coefficients equal to zero).
Lastly, we look at the multiple R-squared. This tells us how much of the variation in our target variable is explained by the model instead of the error. We can see that 20.7% of the variation is explained by our explanatory variables. This is low, but we still have work to do.
When working with multiple linear regressions, you want to remove/add one explanatory variable at a time. Here we will take away the highest p-value of the t statistic since this tells us it is the explanatory variable that least explains our target variable’s variation. The chosen one is … surface.
We will now run the model again.
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
We can first see that all the t statistic associated p-values radically decreased. Still, strategic remains above the typical threshold of 0.05. This mean there is more than a 5% probability that the variation explained by that coefficient could be happening by chance. We might have to remove it but let us look at the other statistical descriptors first.
The F-statistic increased and its value decreased. This tells us the this model is better at explaining the variation of our target variable then the last one.
Lastly, the the multiple R-squared lightly increased; which tells us that the model now explains 21% of the variation in the target variable.
Even with the improved results, I wanna test what happens if we remove our highest t value associated p-value explanatory statistic to evaluate is the model works better without it.
my_model3 <- lm(points ~ attitude, data = learning2014)
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The t statistic associated p-value, F statistic associated p-value did not significantly change (less than one order of magnitude). Additionally, the multiple R-squared got worse. Due to this I will keep the second model as our best chance of explaining the data.
In our selected model, Model2 (points ~ attitude + strategic), we have a 3.4658 coefficient for attitude and a 0.9137 coefficient for strategic. This means that every 1 unit increase in the attitude variable will represent 3.5 unit increase in points, while every 1 unit increase in the strategic variable will represent 0.9 unit increase in points.
To study our selected model, we will generate diagnostic plots. Here we use plots that visually represents the relationships of the residuals (difference between observed and fitted values) in the model. This will help us validate the assumptions made by the model.
par(mfrow = c(2,2))
plot(my_model2, which = c(1,2,5))
On the first plot we have the Residuals vs the Fitted values. This plot related the residuals (error) with the calculated values for the target variable. As we can observe in the plot, the variability of the results appears to remain constant regardless of the size of the fitted values and there is no clear pattern. It confirms the assumption that the size of the errors does not depend on the explanatory variable.
The second plot is the Normal qq-plot. This plot aid in checking for symmetry and normality of the error term in the regression. We see that most of the values fall on the identity line, but both the lower and upper tails of the data are skewed. this makes ud doubt the normality assumption of the errors in the model.
Lastly, we have the residuals vs leverage plot. The leverage plot helps us observe the influence of specific points in the data. We can see that there is no clear point with an increased impact over the model.
date()
## [1] "Thu Dec 09 16:17:44 2021"
Here we go again…
Remember! Set your working directory and load your libraries.
knitr::opts_knit$set(root.dir ="C:/Users/ramosgon/OneDrive - University of Helsinki/Courses/OpenDataScience/IODS-project/Data")
library(tidyverse)
library(ggplot2)
library(GGally)
library(ggpubr)
library(boot)
We read the data.
pormath <- read.csv("pormath.csv", header = T, sep = ",")
The data comes from P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. The data explores the attributes of student lives and their effect on their academic performance in math and Portuguese classes. For more information on the data set, visit this link.
For this analysis, we sill be focusing on the alcohol use of the students and the effect that various attributes might have on it.
We first explore the dimensions and structure.
dim(pormath)
## [1] 370 50
str(pormath)
## 'data.frame': 370 obs. of 50 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ id.p : int 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
## $ id.m : int 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 ...
## $ failures.p: int 0 0 0 0 0 0 0 0 0 0 ...
## $ failures.m: int 0 0 3 0 0 0 0 0 0 0 ...
## $ paid.p : chr "no" "no" "no" "no" ...
## $ paid.m : chr "no" "no" "yes" "yes" ...
## $ absences.p: int 4 2 6 0 0 6 0 2 0 0 ...
## $ absences.m: int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1.p : int 0 9 12 14 11 12 13 10 15 12 ...
## $ G1.m : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2.p : int 11 11 13 14 13 12 12 13 16 12 ...
## $ G2.m : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3.p : int 11 11 12 14 13 13 13 13 17 13 ...
## $ G3.m : int 6 6 10 15 10 15 11 6 19 15 ...
## $ failures : num 0 0 1.5 0 0 0 0 0 0 0 ...
## $ paid : chr "no" "no" "no" "no" ...
## $ absences : num 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : num 2.5 7 9.5 14.5 8.5 13.5 12.5 8 15.5 13 ...
## $ G2 : num 8.5 8 10.5 14 11.5 13.5 12 9 17 13.5 ...
## $ G3 : num 8.5 8.5 11 14.5 11.5 14 12 9.5 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
## $ cid : int 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
Since we are exploring the alcohol use, we will select the variables high_use and alc_use. The variable high_use marks students that consume alcohol on a frequency above 2 (numeric: from 1 - very low to 5 - very high) during the whole week while alc_use actual value (numeric: from 1 - very low to 5 - very high). Once we have our target variables, we select the effect variables that we wish to study. I am interested in finding out if age, health, or the family situation affect the alcohol consumption. Therefore, I will select age (numeric: from 15 to 22), health (numeric: from 1 - very bad to 5 - very good), famrel (numeric: from 1 - very bad to 5 - excellent), and famsup (binary: yes or no). Additionally, it is always interesting to take a look at the influence of gender, so I will be including sex (binary: ‘F’ - female or ‘M’ - male).
My hypothesis are:
The three formats for the data shown below will allow us to create different types of graphs as it will be seen in the rest of the report.
## Select the columns as they are
pormath_subset <- pormath %>%
select(sex, high_use, alc_use, age, health, famrel, famsup)
## Turn sex, health, famrel, and famsup into factors
pormath_subset_fct <- pormath %>%
select(sex, high_use, alc_use, age, health, famrel, famsup) %>%
mutate(across(c(sex, health, famrel, famsup), factor))
## Turn health, famrel, and famsup into numeric values
pormath_subset_nmr <- pormath_subset_fct %>%
mutate(across(c(health, famrel, famsup), as.numeric))
As a first look, we can explore our variables numerically. We can do this both as numeric variables:
pormath_subset_nmr %>%
summary
## sex high_use alc_use age health
## F:195 Mode :logical Min. :1.000 Min. :15.00 Min. :1.000
## M:175 FALSE:259 1st Qu.:1.000 1st Qu.:16.00 1st Qu.:3.000
## TRUE :111 Median :1.500 Median :17.00 Median :4.000
## Mean :1.889 Mean :16.58 Mean :3.562
## 3rd Qu.:2.500 3rd Qu.:17.00 3rd Qu.:5.000
## Max. :5.000 Max. :22.00 Max. :5.000
## famrel famsup
## Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:1.000
## Median :4.000 Median :2.000
## Mean :3.935 Mean :1.624
## 3rd Qu.:5.000 3rd Qu.:2.000
## Max. :5.000 Max. :2.000
or as factors and their frequency:
pormath_subset_fct %>%
summary
## sex high_use alc_use age health famrel
## F:195 Mode :logical Min. :1.000 Min. :15.00 1: 46 1: 8
## M:175 FALSE:259 1st Qu.:1.000 1st Qu.:16.00 2: 42 2: 18
## TRUE :111 Median :1.500 Median :17.00 3: 80 3: 64
## Mean :1.889 Mean :16.58 4: 62 4:180
## 3rd Qu.:2.500 3rd Qu.:17.00 5:140 5:100
## Max. :5.000 Max. :22.00
## famsup
## no :139
## yes:231
##
##
##
##
We can observe the following:
Now we look at the graphical representation separated by high_use:
pormath_subset %>%
select(-sex) %>% # remove sex
ggpairs(mapping = aes(col = high_use, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
and sex:
pormath_subset %>%
select(-high_use) %>% # remove high_use
ggpairs(mapping = aes(col = sex, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
Immediately we can notice that:
We can now look at the means and std deviations of the data by cross-tabulation:
pormath_subset_nmr %>%
group_by(high_use, sex) %>%
summarise(
AvgAge = mean(age),
sdAge = sd(age),
AvgHealth = mean(health),
sdHealth= sd(health),
AvgFamRel = mean(famrel),
sdFamRel = sd(famrel),
AvgFamSup = mean(famsup),
sdFamSup = sd(famsup))
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 4 x 10
## # Groups: high_use [2]
## high_use sex AvgAge sdAge AvgHealth sdHealth AvgFamRel sdFamRel AvgFamSup
## <lgl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FALSE F 16.6 1.09 3.37 1.40 3.92 0.926 1.69
## 2 FALSE M 16.3 1.24 3.67 1.41 4.13 0.833 1.55
## 3 TRUE F 16.5 1.12 3.39 1.55 3.73 0.837 1.73
## 4 TRUE M 16.9 1.23 3.93 1.28 3.79 0.991 1.51
## # ... with 1 more variable: sdFamSup <dbl>
But this is hard to read, so let us show it graphically. In the following graph we have the average values of each variable with their std error bars.
pormath_subset_nmr %>%
pivot_longer(cols = alc_use:famsup, names_to = "Attribute") %>%
group_by(high_use, sex, Attribute) %>%
summarise(Avg = mean(value),
se = sd(value)/sqrt(length(value))) %>%
filter(Attribute != "alc_use") %>%
ggplot(aes(x = sex, y = Avg, fill = high_use)) +
geom_bar(position = position_dodge(),
stat = "identity",
alpha = 0.6) +
geom_errorbar(aes(x = sex, ymin = Avg-se, ymax = Avg+se),
position = position_dodge(width = 0.9),
width = 0.6,
alpha = 0.6)+
facet_wrap("Attribute", scales = "free") +
theme_classic()
## `summarise()` has grouped output by 'high_use', 'sex'. You can override using the `.groups` argument.
As expected, high_use presents difference in means larger than standard error in age and famrel but mostly on male students. As well, health shows some difference in the means famrel but only on male students and less than the standard error.
Now we look at box plots of each variable to confirm the data distribution.
First, age:
g1 <- pormath_subset_nmr %>%
ggplot(aes(x = high_use, y = age, fill = sex)) +
geom_boxplot(alpha = 0.4) +
facet_wrap("sex") +
theme_classic()
g2 <- pormath_subset_nmr %>%
ggplot(aes(x = alc_use, y = age, group = alc_use, fill = sex)) +
geom_boxplot(alpha = 0.4) +
facet_wrap("sex") +
theme_classic()
ggarrange(g1, g2, ncol = 1)
On male particularly, there seems to be an increase in alc_use as age increases.
Then, health:
g1 <- pormath_subset_nmr %>%
ggplot(aes(x = high_use, y = health, fill = sex)) +
geom_boxplot(alpha = 0.4) +
facet_wrap("sex") +
theme_classic()
g2 <- pormath_subset_nmr %>%
ggplot(aes(x = alc_use, y = health, group = alc_use, fill = sex)) +
geom_boxplot(alpha = 0.4) +
facet_wrap("sex") +
theme_classic()
ggarrange(g1, g2, ncol = 1)
No clear pattern is vissible.
Now, famrel:
g1 <- pormath_subset_nmr %>%
ggplot(aes(x = high_use, y = famrel, fill = sex)) +
geom_boxplot(alpha = 0.4) +
facet_wrap("sex") +
theme_classic()
g2 <- pormath_subset_nmr %>%
ggplot(aes(x = alc_use, y = famrel, group = alc_use, fill = sex)) +
geom_boxplot(alpha = 0.4) +
facet_wrap("sex") +
theme_classic()
ggarrange(g1, g2, ncol = 1)
I looks like students with high_use tend to have lower famrel.
famsup is a bit tricky, since it is a factor variable. Therefore we will use frequency and mean plots instead of box plots.
g1 <- pormath_subset_fct %>%
group_by(sex, famsup, high_use) %>%
summarise(count = n()) %>%
ggplot(aes(x = famsup, y = count, fill = high_use )) +
geom_bar(position = "dodge",stat = "identity",alpha = 0.4) +
facet_wrap("sex") +
theme_classic()
## `summarise()` has grouped output by 'sex', 'famsup'. You can override using the `.groups` argument.
g2 <- pormath_subset_fct %>%
group_by(sex, famsup) %>%
summarise( se = sd(alc_use)/sqrt(length(alc_use)),
alc_use = mean(alc_use)) %>%
ggplot(aes(x = famsup, y = alc_use, fill = sex)) +
geom_bar(position = "dodge",stat = "identity",alpha = 0.4) +
geom_errorbar(aes(x = famsup, ymin = alc_use-se, ymax = alc_use+se),
position = position_dodge(width = 0.9),
width = 0.6,
alpha = 0.6) +
facet_wrap("sex") +
theme_classic()
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
ggarrange(g1, g2, ncol = 1)
alc_use does not seem to significantly change based on the famsup.
Lastly, I wanted to look at their average academic performance,which is depicted in the G3 final grade (numeric: from 0 to 20, output target):
g1 <- pormath %>%
ggplot(aes(x = high_use, y = G3, fill = sex)) +
geom_violin(width = 1, alpha = 0.5) +
geom_boxplot(position = position_dodge(width =1),
width = 0.1,
alpha = 0.2) +
facet_wrap("sex") +
theme_classic()
g2 <- pormath %>%
ggplot(aes(x = alc_use, y = G3, group = alc_use, fill = sex)) +
geom_violin( width = 0.5, alpha = 0.5) +
geom_boxplot( width = 0.1, alpha = 0.2) +
facet_wrap("sex") +
theme_classic()
ggarrange(g1, g2, ncol = 1)
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Groups with fewer than two data points have been dropped.
While there is no significant change in the means, note how the variation in the final grade is much less as the alc_use and high_use increase. This means that the performance of those students that consume less alcohol has a wider distribution.
But, let us focus on building our model. We start by including the four variables of interest into the logistic regression model and looking at the ANOVA presented by the summary() function.
model1 <- glm(high_use ~ age + health + famrel + famsup,
data = pormath_subset,
family = "binomial")
summary(model1)
##
## Call:
## glm(formula = high_use ~ age + health + famrel + famsup, family = "binomial",
## data = pormath_subset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4952 -0.8754 -0.7331 1.3372 2.0207
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.65006 1.75593 -2.079 0.0376 *
## age 0.21535 0.09932 2.168 0.0301 *
## health 0.16502 0.08600 1.919 0.0550 .
## famrel -0.32958 0.12617 -2.612 0.0090 **
## famsupyes -0.14944 0.23926 -0.625 0.5322
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 437.98 on 365 degrees of freedom
## AIC: 447.98
##
## Number of Fisher Scoring iterations: 4
famsup is a categorical variable, so the tests are performed a bit differently. Here, the Wald test is performed to test whether the difference between the coefficient of the reference (no) and the level (yes) is different from zero. Therefore, we know that there is no significant difference between this levels. To confirm if the whole variable isn’t significant, we remove the intercept.
model2 <- glm(high_use ~ age + health + famrel + famsup - 1,
data = pormath_subset,
family = "binomial")
summary(model2)
##
## Call:
## glm(formula = high_use ~ age + health + famrel + famsup - 1,
## family = "binomial", data = pormath_subset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4952 -0.8754 -0.7331 1.3372 2.0207
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## age 0.21535 0.09932 2.168 0.0301 *
## health 0.16502 0.08600 1.919 0.0550 .
## famrel -0.32958 0.12617 -2.612 0.0090 **
## famsupno -3.65006 1.75593 -2.079 0.0376 *
## famsupyes -3.79950 1.72571 -2.202 0.0277 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 512.93 on 370 degrees of freedom
## Residual deviance: 437.98 on 365 degrees of freedom
## AIC: 447.98
##
## Number of Fisher Scoring iterations: 4
By removing the intercept, we are now testing if each coefficient of the categorical variable is different from zero. We see that the individual coefficients are now considered significant.
To resolve this contradiction, we can create a third model that does not contain the variable famsup and compare it to our first model using a likelihood ratio test. This will tell us the probability of all coefficients from famsup being zero.
model3 <- glm(high_use ~ age + health + famrel,
data = pormath_subset,
family = "binomial")
summary(model3)
##
## Call:
## glm(formula = high_use ~ age + health + famrel, family = "binomial",
## data = pormath_subset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4559 -0.8625 -0.7425 1.3298 2.0631
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.86577 1.72440 -2.242 0.02497 *
## age 0.22295 0.09865 2.260 0.02382 *
## health 0.16310 0.08593 1.898 0.05770 .
## famrel -0.32859 0.12608 -2.606 0.00916 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 438.37 on 366 degrees of freedom
## AIC: 446.37
##
## Number of Fisher Scoring iterations: 4
anova(model1, model3, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: high_use ~ age + health + famrel + famsup
## Model 2: high_use ~ age + health + famrel
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 365 437.98
## 2 366 438.37 -1 -0.3887 0.533
We see the high Pr(>Chi) tells us that the variable should not remain in the model.
Now we can also see that health is not significant even though it is quite close. Therefore we will try removing it.
model4 <- glm(high_use ~ age + famrel,
data = pormath_subset,
family = "binomial")
summary(model4)
##
## Call:
## glm(formula = high_use ~ age + famrel, family = "binomial", data = pormath_subset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3086 -0.8605 -0.7384 1.3540 1.8530
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.14909 1.66508 -1.891 0.0586 .
## age 0.20841 0.09761 2.135 0.0328 *
## famrel -0.29917 0.12361 -2.420 0.0155 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 442.08 on 367 degrees of freedom
## AIC: 448.08
##
## Number of Fisher Scoring iterations: 4
This caused the intercept to loose significance, but the remaining variables are marked as significant. Regardless, after running the predictive and cross-validation analysis, model 3 outperformed model 4. Therefore, I consider it to be the best available model with these variables.
So regarding my hypothesis:
So let’s look at the odds ratios with their CI’s.
model <- model3
OR <- model %>%
coef %>%
exp
CI <- model %>%
confint %>%
exp
## Waiting for profiling to be done...
m_OR <- cbind(OR, CI)
m_OR
## OR 2.5 % 97.5 %
## (Intercept) 0.02094671 0.0006882756 0.6060803
## age 1.24976103 1.0307101817 1.5191667
## health 1.17715618 0.9972312442 1.3977846
## famrel 0.71994050 0.5607948332 0.9210363
What we can conclude from these coefficients is that:
A 2X2 cross tabulation can give us a better idea of the predictive power of the model:
probs <- predict(model, type = "response") # Predict probability responses based on the variables
pormath_predict <- pormath_subset %>%
mutate(probability = probs) %>% # add them to our data.frame
mutate(prediction = probability > 0.5) # Compare those probabilities to our significance cutoff to identify the logical value
# Plot the 2X2 cross tabulation of the predictions
table(high_use = pormath_predict$high_use,
prediction = pormath_predict$prediction) %>%
prop.table %>%
addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68648649 0.01351351 0.70000000
## TRUE 0.28648649 0.01351351 0.30000000
## Sum 0.97297297 0.02702703 1.00000000
Around 70% of the values were predicted correctly. We can also see a graphical depiction:
pormath_predict %>%
ggplot(aes(x = probability, y = high_use, col = prediction)) +
geom_point() +
theme_classic()
We also can compute the total proportion of inaccurately classified individuals or training error by using the following function:
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
We can apply it to the data and compare its performance against a 10-fold cross-validation.
te <- loss_func(class = pormath_predict$high_use, prob = pormath_predict$probability)
set.seed(1)
cv <- pormath_predict %>%
cv.glm(cost = loss_func, glmfit = model, K = 10)
cat(paste("Training error: ", te, "\n10-fold CV: ", cv$delta[1]))
## Training error: 0.3
## 10-fold CV: 0.308108108108108
Using the variables I selected, I could not make the model have a better performance than that of the DataCamp exercise.
date()
## [1] "Thu Dec 09 16:18:03 2021"
Here we go again…
Remember! Set your working directory and load your libraries.
knitr::opts_knit$set(root.dir ="C:/Users/ramosgon/OneDrive - University of Helsinki/Courses/OpenDataScience/IODS-project/Data")
library(tidyverse)
library(ggplot2)
library(MASS)
library(corrplot)
library(GGally)
library(car)
We read the data.
data("Boston")
“Housing Values in Suburbs of Boston” is a data set native to the MASS package in R. The data set includes various indicators for the different Boston suburbs. Further explanation on the variables can be found here.
We first explore the dimensions and structure.
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The data set contains mostly numeric variables with the exception of rad (index of accessibility to radial highways) and chas (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)). They are a ranked factor and a logical variable respectively. Therefore we convert them and look at the results.
Note: Because of rad increased number of factor levels, we will keep as integer.
Boston <- Boston %>%
mutate(chas = as.logical(chas))
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Mode :logical
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 FALSE:471
## Median : 0.25651 Median : 0.00 Median : 9.69 TRUE :35
## Mean : 3.61352 Mean : 11.36 Mean :11.14
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :88.97620 Max. :100.00 Max. :27.74
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The past table provides a lot on information on quartile distribution of our variables, but I always rather look at the variables with boxplots or barplots:
Boston %>%
pivot_longer(cols = c(1:14), names_to = "Attribute", values_to = "Value") %>%
filter(!Attribute %in% c("chas","rad")) %>%
ggplot(aes(x = Attribute, y = Value)) +
geom_boxplot() +
facet_wrap("Attribute", scales = "free") +
theme_classic()
Boston %>%
pivot_longer(cols = c(1:14), names_to = "Attribute", values_to = "Value") %>%
filter(Attribute %in% c("chas","rad")) %>%
ggplot(aes(x = Value)) +
geom_histogram() +
facet_wrap("Attribute", scales = "free") +
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can conclude:
Additionally, we can look at the normality of the non-factorial distributions using the qqPlot.
colnames(Boston)[-c(4,9)] %>% sapply(function(x){
qqPlot(Boston[,x], main = x)
})
## crim zn indus nox rm age dis tax ptratio black lstat medv
## [1,] 381 58 489 143 366 42 354 489 197 451 375 162
## [2,] 419 200 490 144 365 75 352 490 198 424 415 163
The paired statistics and scatter plots of the variables are also useful to start looking at the data.
Boston %>%
ggpairs(mapping = aes(alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor", size = 1.5)))
From here we can see that:
Another way of looking at this values is a correlation plot (corrplot or ggcorr):
data("Boston")
cor_mat <- cor(Boston) %>% round(digits = 2)
corrplot(cor_mat, method = "circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
ggcorr(Boston, geom = "circle", method = c("everything", "pearson"))
These show the same information as the ggpairs plot.
due to the different dimensions of each variable, it is hard to compute the influence of eac one and compare them reliably. Therefore, we present a summary of the scaled variables.
boston_scaled <- Boston %>%
scale %>%
as.data.frame
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Now we can compare between them and say that crime presents the biggest range of scaled data and we can start analyzing the scaled effects of each variable over the crime rate.
First we separate the crime variable in quartiles and assign the low, med-low, med-high, and high labels to each quartile.
bins <- quantile(boston_scaled$crim) # binning in quartiles
crime <- boston_scaled$crim %>%
cut(breaks = bins,
include.lowest = TRUE,
labels = c("low", "med_low", "med_high", "high"))
We now substitute the numeric crime variable for the binned quartiles.
boston_qt <- boston_scaled %>%
mutate(crim = crime)
Once we have the factor variable, it is time to generate our training and testing data sets. Our training data set will be used to generate an explanatory model for our target variable crime. The model will be applied to the test data to predict the outcome. Note that the sampling is random, so each iteration of this testing might give slightly different models.
set.seed(123) # to maintain random variables
n <- nrow(boston_qt)
ind <- sample(n, size = n * 0.8)
train <- boston_qt[ind,]
test <- boston_qt[-ind,]
Save and remove the crime levels from the testing data.
correct_classes <- test$crim
test <- dplyr::select(test, -crim)
And now we are ready to do our linear discriminant analysis (LDA). LDA is a classification method used for the modelling of categorical variables. Let’s remember that this model assumes normality of the variables and that is why we do the scaling first. Additionally, variables should be continuous, but for this exercise we can ignore chas.
lda.fit <- lda(crim ~ ., data = train)
lda.fit
## Call:
## lda(crim ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2500000 0.2500000 0.2450495
##
## Group means:
## zn indus chas nox rm age
## low 1.01866313 -0.9066422 -0.08120770 -0.8835724 0.38666334 -0.9213895
## med_low -0.07141687 -0.3429155 0.03952046 -0.5768545 -0.09882533 -0.3254849
## med_high -0.40148591 0.2162741 0.19544522 0.3637157 0.12480815 0.4564260
## high -0.48724019 1.0149946 -0.03371693 1.0481437 -0.47733231 0.8274496
## dis rad tax ptratio black lstat
## low 0.9094324 -0.6819317 -0.7458486 -0.4234280 0.3729083 -0.766883775
## med_low 0.3694282 -0.5395408 -0.4205644 -0.1079710 0.3103406 -0.164921798
## med_high -0.3720478 -0.4349280 -0.3191090 -0.2716959 0.1049654 0.009720124
## high -0.8601246 1.6596029 1.5294129 0.8057784 -0.6383964 0.900379309
## medv
## low 0.47009410
## med_low 0.01548761
## med_high 0.19440747
## high -0.71294711
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.148390645 0.74870532 -1.0874785
## indus 0.040971465 -0.38126652 -0.1619456
## chas 0.002460776 0.03963849 0.1699312
## nox 0.312458150 -0.67564471 -1.3104018
## rm 0.011086947 -0.16058718 -0.1572603
## age 0.283482486 -0.38817624 -0.1013491
## dis -0.079848789 -0.38493775 0.2108038
## rad 3.718978412 0.93123532 -0.4706522
## tax -0.015197127 0.04230505 1.2889075
## ptratio 0.180294774 -0.01592588 -0.3558490
## black -0.136724112 0.02948075 0.1288959
## lstat 0.145739238 -0.37823065 0.3345688
## medv 0.061327205 -0.53906503 -0.1509890
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9523 0.0364 0.0113
The prior probabilities tell us the distribution of probability of falling into either of the groups. The coefficients of linear discriminants by variable give us the the coefficient of each linear discriminant that maximize the ratio of the between-group variance to its within-group variance. Lastly, the proportion of trace is the amount of the between-group variance that is explained by each corresponding linear discriminant (eg. LD1 explains 96% of the between-group variance).
The following code takes the coefficients to draw arrows representing the size of the effect of the scaled valuable over the model and applies it to the plot below as arrows.
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
We change the crime variable in our training data set to a numeric vector.
classes <- as.numeric(train$crim)
We plot the LDA fitted model (LD1 vs LD2) and separate each class using colors and figures. We also run the function to draw the arrows.
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
We can observe that rad has the biggest influence over the separation of the model. Furthermore, its influence is mostly explained by the LD1. This means that the changes in the rad variable are more likely to influence the crime rate into a high value.
Now we can test the predictive power of this model over the test data set and create a table to compare the results with those stored in correct_classes.
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 10 10 4 0
## med_low 2 17 6 0
## med_high 1 9 13 2
## high 0 0 1 27
We can now apply the same loss function we used before to asses the proportion of wrong predictions.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(as.numeric(correct_classes),as.numeric(lda.pred$class))
## [1] 0.3431373
We can see the 34% of the predictions were not correct.
For the next part of the analysis, we will observe the different ways of comparing the similarity of the different towns. We will use both euclidean and Manhattan distances. Euclidean is simple the geometric distance between two points, while Manhattan distance observes the absolute differences between the coordinates of two points.
First we measure the euclidean distance:
data("Boston")
boston_scaled <- Boston %>%
scale %>%
as.data.frame
dist_eu <- dist(boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
and the Manhattan distance as well:
dist_man <- dist(boston_scaled, method = 'manhattan')
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
These two distance are applied to clustering protocols by minimizing the distance between the observatoions.
One example of such clustering protocols is the k-means clustering. Note that, contrary to the LDA classification method, the clustering is an unsupervised method that does not have the final classification determined a priori. Also remember that the distance measuring method will change the output of your clustering.
So now we run the k-means clustering method on the scaled data starting by three centroids.
km <- kmeans(boston_scaled, centers = 3)
pairs(boston_scaled, col = km$cluster)
In the plot we can observe how the distribution into the clusters separates the data in the pairwise scatter plots. But this was just a guessed number of centroids. We can optimize this by using the within cluster sum of squares (WCSS).
We will run the k-means clustering method changing th number of clusters from 1 to 10. We will take the WCSS (i.e. variance between the obs in a cluster and the center of that cluster) and compare it as the number of centers increases.
so first we run all the different k in 1:10, and store the total WCSS (TWCSS)
set.seed(123) # to maintain the result of random variables.
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
Now we plot the TWCSS’s vs the number of centers.
qplot(x = 1:k_max, y = twcss, geom = 'line')
We can see that the biggest change occurs between 1 and 2 centers. We will try now the clustering with two centers.
km <- kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)
As we observe the scatter plots we can appreciate that the groups ar better defined then when using 3 centers. Therefore, 2 centers is our optimal k for our clustering analysis.
For the bonus points: LDA on k-means
data("Boston")
boston_scaled <- Boston %>%
scale %>%
as.data.frame
set.seed(123) # to maintain random variables
km <- kmeans(boston_scaled, centers = 3)
boston_km <- boston_scaled %>%
mutate(crim = as.factor(km$cluster))
n <- nrow(boston_km)
ind <- sample(n, size = n * 0.8)
train <- boston_km[ind,]
test <- boston_km[-ind,]
correct_classes <- test$crim
test <- dplyr::select(test, -crim)
lda.fit <- lda(crim ~ ., data = train)
classes <- as.numeric(train$crim)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
We can observe that rad remains the biggest influence over the separation of the model. Furthermore, its influence is mostly explained by the LD1 again. This means that the changes in the rad variable are more likely to influence the crime rate into a 1 cluster. Then age seems to have the next most influential value. Age seems to mostly separate cluster 2 and 3 in the LD2 axis.
date()
## [1] "Thu Dec 09 16:18:37 2021"
Here we go again…
Remember! Set your working directory and load your libraries.
knitr::opts_knit$set(root.dir ="C:/Users/ramosgon/OneDrive - University of Helsinki/Courses/OpenDataScience/IODS-project/Data")
library(tidyverse)
library(car)
library(GGally)
library(FactoMineR)
We read the data.
human <- read.csv("human_base.csv", header = T, sep = ",")
The Human Development Index (HDI) is a summary measure of average achievement of a country in human development such as a long and healthy life with a decent standard of living. The HDI is the geometric mean of normalized indices that represent theHuman Development separated into three categories. The Gender Inequality Index (GII) reflects gender-based disadvantage based on reproductive health, empowerment and the labor market. GII is shown together with HDI to show the difference in human development between genders due to inequality between female and male achievements. The following data sets consolidate the HDI and GII indicators (17) for 195 countries. For more information on the topic, visit the following links:
http://hdr.undp.org/sites/default/files/hdr2015_technical_notes.pdf
http://hdr.undp.org/en/content/human-development-index-hdi
First we re-format the GNI to contain numbers intead of char.
human$GNI <- human$GNI %>%
gsub(",","", x = .) %>%
as.numeric
We select the variables of interest for our study and remove those country that are missing any of the indexes (i.e. NA’s).
keep <- c("Country", "Edu2.FM", "Labo.FM", "Edu.Exp", "Life.Exp",
"GNI", "Mat.Mor", "Ado.Birth", "Parli.F")
human <- human %>%
dplyr::select(one_of(keep))
human_ <- human %>%
filter(complete.cases(human))
Looking at the tail of the table, we can notice that the last seven observations are regions instead of countries. Therefore, we remove them as we change the country names into the row names.
tail(human_, 10)
## Country Edu2.FM Labo.FM Edu.Exp Life.Exp GNI
## 153 Chad 0.1717172 0.8080808 7.4 51.6 2085
## 154 Central African Republic 0.3782772 0.8531140 7.2 50.7 581
## 155 Niger 0.3076923 0.4459309 5.4 61.4 908
## 156 Arab States 0.7289916 0.3081009 12.0 70.6 15722
## 157 East Asia and the Pacific 0.8250377 0.7884131 12.7 74.0 11449
## 158 Europe and Central Asia 0.8784119 0.6514286 13.6 72.3 12791
## 159 Latin America and the Caribbean 0.9836957 0.6729323 14.0 75.0 14242
## 160 South Asia 0.5329670 0.3711083 11.2 68.4 5605
## 161 Sub-Saharan Africa 0.7015873 0.8537859 9.6 58.5 3363
## 162 World 0.8333333 0.6558018 12.2 71.5 14301
## Mat.Mor Ado.Birth Parli.F
## 153 980 152.0 14.9
## 154 880 98.3 12.5
## 155 630 204.8 13.3
## 156 155 45.4 14.0
## 157 72 21.2 18.7
## 158 28 30.8 19.0
## 159 85 68.3 27.0
## 160 183 38.7 17.5
## 161 506 109.7 22.5
## 162 210 47.4 21.8
last <- nrow(human_) - 7
human_ <- human_[1:last, ]
rownames(human_) <- human_$Country
human <- human_[,-1]
Now we can take a look at the data. With 155 countries remaining,we will be studying the interactions of 8 continuous variables of int or num type.
dim(human)
## [1] 155 8
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : num 64992 42261 56431 44025 45435 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
As in the last chapter, it is hard to compare the range of the variables or their means and std deviations without first scaling, so lets look at them graphically first.
human %>% rownames_to_column(var = "Country") %>%
pivot_longer(cols = c(-1), names_to = "Attribute", values_to = "Value") %>%
ggplot(aes(x = Attribute, y = Value)) +
geom_boxplot() +
facet_wrap("Attribute", scales = "free", nrow = 2) +
theme_classic()
Most of the variables seem to show normal distributions according to the box plots. They mostly lack outliers and the median is close to the center of the range. Maternal mortality and GNI do not show this characteristics. First of all GNI has a range at least 3 orders of magnitude bigger than any other variable. also, both of these variables show a concentrated 3 quartiles in the lower par of their range and outliers the go way higher.
Let us use the qqplots to confirm the Normaility of these distributions.
colnames(human) %>% sapply(function(x){
qqPlot(human[,x], main = x)
})
## Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth Parli.F
## [1,] 153 116 155 126 30 150 155 136
## [2,] 141 141 2 133 44 153 148 104
As expected from the box plots, GNI and maternal mortality diverge from the identity line in our qqplots making their normality questionable.
To continue our analysis, we will look at the correlation coefficients. For correlation comparison, scaling is not considered necesary. Therefore, let us see the situation when the values are not scaled.
human %>%
ggpairs(mapping = aes(alpha = 0.3))
human %>%
ggcorr(method = c("everything", "pearson"))
From here we can see:
We could keep discussing the correlation between variables, but due to the high complexity of the model due to the amount of variables we might reach unclear conclusions. Therefore, we will use a dimensionality reduction technique. Principal component analysis or PCA lets us focus in the main dimensions that provide the explanation for most of the variation in the data points. PCA starts by defining a dimension in which the distance between the points is maximized. This will be our first principal component since it accounts for most of the variation. From then on, the algorithm continues looking for the dimensions that maximize the remaing distances. The catch is that starting on the second principal component, this dimension must be orthogonal to the last one. For further details on PCA watch this video which I love.
So lets go at it. Note that I have not scaled the values yet.
pca_human <- prcomp(human)
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
From the summary table you can notice that PC1 accounts for 99.99% of the variance. With this result, we know that this dimension explains all of the variance so Hurray!.
Wait, there is a catch. Let us look at the graphical representation of the first 2 PCs in a biplot compared to the size of each variables influence (the size will be represented by pink arrows that show the projection of each variables axix unto the 2-D plain created between PC1 and PC2).
pca_pr <- round(100*s$importance[2,], digits = 2)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Just as expected, the lack of scaling is causing problems. As discussed GNI is over 3 orders of magnitude bigger than any other variable and, thus, any variation in that variable will cause the PCA to be centered on that dimension. That is why the projection of the GNI dimension is so prominent: PC1 is almost parallel to it.
Since PCA works by maximizing variance, we have to scale the variables for the analysis to be worthwhile.So here we go with the scaled data.
human_std <- scale(human)
pca_human <- prcomp(human_std)
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
Much better, now we can see how the distribution of the variation is spread between the different PCs. But let us look at PC1 and PC2 since they account for almost 70% of the variation.
pca_pr <- round(100*s$importance[2,], digits = 2)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
Now we can see that:
Since it is not my are of expertise I would rather not make very insightful observations, but there does seem to be a relation between the developed countries and their human development versus developing countries.
For data sets with categorical data, the MCA is a good alternative to the PCA. Here we call for the tea data set from the Factominer package.
data(tea)
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
The data used here comes from a questionnaire about tea consumption. 300 individuals were interviewed about how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).
Due to them being categorical variables, lets just look at the bar plots to see the distributions.
tea %>%
dplyr::select(-age) %>%
pivot_longer(cols = c(-sex), names_to = "Attribute", values_to = "Value") %>%
ggplot(aes(Value, fill = sex)) +
facet_wrap("Attribute", scales = "free") +
geom_bar(position = "dodge",alpha = 0.4) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
We can see that:
Again, all of this variables can seem very overwhelming, but the MCA can simplify things. But even before that, let us select the variables involved in how they drink coffee.
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea <- dplyr::select(tea, one_of(keep_columns))
MCA uses frequencies instead of geometrical distances to calculate the patterns between categorical variables. It uses multi-dimentional cross tabulations of the data, There are various methods for this, like an indicator matrix or the Burt matrix. Their main idea is to separate the categorical variables into their categories and align them to the observation.
Using the inter- and intra-category variances, the algorythm will generate new dimensions that explain the most variance possible. Let’s try it.
mca <- tea %>%
MCA(graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = ., graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
We can conclude that:
It is much easier to interpret by looking at the graphical depiction of this analysis.
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")
Note that this is the categories plot. It will tell us which categories are closely related to each other. We can therefore concluded:
With this exercise we could explore how the categories are distributed. Remember that these dimensions only account for 11% of the variance. Therefore, always try to explore at least some of the next dimensions (this also applies to PCA).
date()
## [1] "Thu Dec 09 16:18:54 2021"
Here we go again…
Remember! Set your working directory and load your libraries.
knitr::opts_knit$set(root.dir ="C:/Users/ramosgon/OneDrive - University of Helsinki/Courses/OpenDataScience/IODS-project/Data")
library(ggpubr)
library(tidyverse)
library(ggplot2)
library(lme4)
library(broom)
The following data is taken from Crowder and Hand (1990). The effect of three diets divided by groups over the weight of mice was studied over 9 weeks. A measurement at time point 0 was made and then measurements were taken weekly, except for week 7 when an extra measurement was taken.
First, we read the data.
rats <- read.csv("rats_long.csv") %>%
mutate(across(c(ID, Group), factor))
And observe its stucture and summarized values.
str(rats)
## 'data.frame': 176 obs. of 4 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ Time : int 1 8 15 22 29 36 43 44 50 57 ...
## $ Weight: int 240 250 255 260 262 258 266 266 265 272 ...
summary(rats)
## ID Group Time Weight
## 1 : 11 1:88 Min. : 1.00 Min. :225.0
## 2 : 11 2:44 1st Qu.:15.00 1st Qu.:267.0
## 3 : 11 3:44 Median :36.00 Median :344.5
## 4 : 11 Mean :33.55 Mean :384.5
## 5 : 11 3rd Qu.:50.00 3rd Qu.:511.2
## 6 : 11 Max. :64.00 Max. :628.0
## (Other):110
We can see we have the long version of the data that has previously been wrangled. An ID factor with 16 levels tells us there were a total of 16 subjects that were weighed 11 times each. we can also see that diet for group 1 was given to double the amount of mice than the separate other two diets. Time seems to be expressed in days instead of weeks. Lastly, the weight of the mice ranges in over 400 grams. Since these measurments have been repeated over the same subjects over an extended period of time, we can label this as longitudinal data.
A great way to start looking at longitudinal data is to plot it directly in a scatter plot of the response, in this case weight vs time. We will also differentiate the plotted lines belonging to each of the three groups.
rats %>%
ggplot(aes(x = Time, y = Weight, group = ID)) +
geom_line(aes(col = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Weight (grams)") +
ggtitle("Rat's weight over time") +
theme_bw()
We can observe that mice with higher weight at the beginning of te study, remain like that through out the study. The same can be said about those with lower weight. Let us try to observe this tracking phenomenon in more detail by standardizing the weight of the mice by time of the measurement.
rats <- rats %>%
group_by(Time) %>%
mutate(stdweight = (Weight - mean(Weight))/sd(Weight)) %>%
ungroup()
rats %>%
ggplot(aes(x = Time, y = stdweight, group = ID)) +
geom_line(aes(col = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Weight (grams)") +
ggtitle("Rat's weight over time") +
theme_bw()
Though it still seems a bit cryptic, the plot shows us a flatter pattern of the weigh measurements, confirming that the trend is there. This is still a complex way to look at the data though. When working with large number of subjects, it is better to look at the average profiles of each of our groups of interest. Let us take the average by time point and group and plot them with their standard error (Note that the se calculation was changed to reflect the non equal number of subjects in each group by simply calling for the n() functrion).
ratsTS <- rats %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n()) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
glimpse(ratsTS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, ~
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2~
## $ se <dbl> 5.381640, 4.629100, 4.057346, 4.808614, 3.909409, 4.166190, 3.87~
ratsTS %>%
ggplot(aes(x = Time, y = mean, col = Group, shape = Group)) +
geom_line() +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se), width=0.5) +
scale_y_continuous(name = "mean(bprs) +/- se(bprs)") +
theme_bw()
From here we can conclude:
If the measurements are equally temporally inter-spaced, we can easily look at a boxplor of each time. Let us remember that it is not the case for this data (week 7), but let us take a look at it.
rats %>%
mutate(across(c(Time), factor)) %>%
ggplot(aes(x = Time, y = Weight, fill = Group)) +
geom_boxplot(alpha = 0.3) +
scale_y_continuous(name = "mean(bprs) +/- se(bprs)") +
theme_bw()
Here we can observe that all three groups seem to have outliers. But let us consider this more carefully once we have performed a summary measure analysis.
Since we want to find out if the rate of change in weight of the mice is the same for every diet, we choose the regression coefficient as a summary measurement. We first calculate the regression coefficient for each ID mice. Then, the resulting data is then summarized in a box plot.
rats_RegS <- rats %>%
group_by(Group, ID) %>%
do(regTimeCoeff = tidy(lm(Weight ~ Time, data = .))$estimate[2]) %>%
mutate(regTimeCoeff = unlist(regTimeCoeff))
glimpse(rats_RegS)
## Rows: 16
## Columns: 3
## Rowwise:
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ regTimeCoeff <dbl> 0.4844071, 0.3303560, 0.3980345, 0.3295261, 0.4058091, 0.~
rats_RegS %>%
ggplot(aes(x = Group, y = regTimeCoeff)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=2, fill = "red") +
scale_y_continuous(name = "Time estimate coefficient in Lm(weight ~ time)") +
theme_bw()
We can conclude:
To avoid bias, let us try this again by removing the outliers and storing their ID’s in a variable.
rats_RegS1 <- rats_RegS %>%
filter(regTimeCoeff>0.25)
outlierIDs <- which(rats_RegS$regTimeCoeff < 0.25) %>%
rats_RegS$ID[.]
rats_RegS1 %>%
ggplot(aes(x = Group, y = regTimeCoeff)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=2, fill = "red") +
scale_y_continuous(name = "Time estimate coefficient in Lm(weight ~ time)") +
theme_bw()
When we have removed the outlier, it is harder to assert if there is a significant difference between group 1 and 3
We still need a formal confirmation if the coefficient differs between group 1 and 2 or 3. We can use a T test between the different groups, but er then risk on multiple testing error. Let us just look at the values regardless.
groups <- c(3,2,1)
stats <- sapply(groups, function(g){
test <- t.test(regTimeCoeff ~ Group, data = filter(rats_RegS, Group != g), var.equal = TRUE)
return(c(test$p.val, unlist(test$conf.int)))
})
colnames(stats) <- c("G 1 and 2:", "G 1 and 3:", "G 2 and 3:")
rownames(stats) <- c("pval", "conf_int_low", "conf_int_high")
t(stats)
## pval conf_int_low conf_int_high
## G 1 and 2: 0.002236939 -0.9372091 -0.27446917
## G 1 and 3: 0.022063830 -0.5439406 -0.05273443
## G 2 and 3: 0.282396430 -0.3297647 0.94476800
We can conclude:
Note that multiple testing can incur in errors as mentioned in Chapter 2. Therefore, the linear regression model provides a more accurate description of the relationships. Let us fit a model and look at the results with the summary() and anova() functions.
fit <- lm(regTimeCoeff ~ Group, data = rats_RegS)
summary(fit)
##
## Call:
## lm(formula = regTimeCoeff ~ Group, data = rats_RegS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.60295 -0.07034 0.04179 0.13911 0.37562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.35964 0.09114 3.946 0.00167 **
## Group2 0.60584 0.15786 3.838 0.00205 **
## Group3 0.29834 0.15786 1.890 0.08127 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2578 on 13 degrees of freedom
## Multiple R-squared: 0.5382, Adjusted R-squared: 0.4671
## F-statistic: 7.574 on 2 and 13 DF, p-value: 0.006594
anova(fit)
## Analysis of Variance Table
##
## Response: regTimeCoeff
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 1.00665 0.50332 7.5743 0.006594 **
## Residuals 13 0.86387 0.06645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With this initial exploratory analysis of the data we can conclude:
The following data is taken from Davis (2002). In the data, 40 male subjects were rated on the brief psychiatric rating scale or BPRS after being treated with one of two treatments. A measurement at time point 0 was made and then measurements were taken every week for 8 weeks. For more information on the BPRS go to this link.
First, we read the data.
bprs <- read.csv("bprs_long.csv") %>%
mutate(across(c(treatment, subject), factor))
And observe its stucture and summarized values.
str(bprs)
## 'data.frame': 360 obs. of 4 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ week : int 0 1 2 3 4 5 6 7 8 0 ...
## $ BPRS : int 42 36 36 43 41 40 38 47 51 58 ...
summary(bprs)
## treatment subject week BPRS
## 1:180 1 : 18 Min. :0 Min. :18.00
## 2:180 2 : 18 1st Qu.:2 1st Qu.:27.00
## 3 : 18 Median :4 Median :35.00
## 4 : 18 Mean :4 Mean :37.66
## 5 : 18 3rd Qu.:6 3rd Qu.:43.00
## 6 : 18 Max. :8 Max. :95.00
## (Other):252
We can see we have the long version of the data that has previously been wrangled. A subject factor with 20 levels tells us there were a total of 40 subjects that were scored in BPRS 9 times each. we can also see that both treatments have the same amount of patients. Time is expressed in weeks. Since these measurments have been repeated over the same subjects over an extended period of time, we can label this as longitudinal data.
A great way to start looking at longitudinal data is to plot it directly in a line plot of the response, in this case BPRS vs week. We will also differentiate the plotted lines belonging to each of the treatments.
bprs %>%
ggplot(aes(x = week, y = BPRS)) +
geom_line(aes(linetype = subject)) +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_wrap("treatment", labeller = label_both) +
theme_bw() +
theme(legend.position = "none")
At this point, we can notice that there seems to be a tendency in all of the subjects to get lower BPRS over time and become less variable. Regardless, it is hard to conclude anything from such a complex data.
If we look at the data set in its long form, it is easy to visualize it can be studied with a multiple linear regression model. Here we ignore that the readings come from the same subject. Let us fit the data to a linear regression model.
bprs_reg <- lm(BPRS ~ week + treatment, data = bprs)
summary(bprs_reg)
##
## Call:
## lm(formula = BPRS ~ week + treatment, data = bprs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
We can clearly see that the second treatment is not statistically different than first one by looking at the p-val above 0.05. Additionally, the model only explains 18.5 % of the variation in the BPRS measurements. Regardless, this model assumes the independence of the measurements and, therefore, is incorrect. To overcome this issue, we will use linear mixed models.
First, we can fit the data to a random intercept model with week and treatment as the variables affecting BPRS. This regression adds the random effect corresponding to the unobserved variables when calculating the model by allowing for each subject’s intercept to be different.
bprs_ref <- lmer(BPRS ~ week + treatment + (1 | subject), data = bprs, REML = F)
summary(bprs_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: BPRS ~ week + treatment + (1 | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
We can see that the random effect variance of the subjects is low compared to the residual variance. This means that there is not a lot of variation between the intercept of the fitted regressions of each subject. Furthermore, the regression estimates for the fixed effect are identical to the regular linear model, and both the intercept and week estimates are significant, but not the treatment 2. This supports that the treatments have no different effect on the BPRS.
Now it is time to calculate the random intercept and random slope model. This model allows for both the slope and intercept to vary between the subjects. This change allow us to take into consideration not only the starting point but also the rate of change in BPRS of the individual subjects together with time. So let us calculate it.
bprs_ref1 <- lmer(BPRS ~ week + treatment + (week | subject), data = bprs, REML = F)
summary(bprs_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: BPRS ~ week + treatment + (week | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
The regression estimates for the fixed effect are identical to the regular linear model and the random intercept model, and both the intercept and week estimates are significant, but not the treatment 2. This supports that the treatments have no different effect on the BPRS.
W can use the anova() to compare our two random models.
anova(bprs_ref1,bprs_ref)
## Data: bprs
## Models:
## bprs_ref: BPRS ~ week + treatment + (1 | subject)
## bprs_ref1: BPRS ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## bprs_ref 5 2748.7 2768.1 -1369.4 2738.7
## bprs_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-val of the chi squared in the anova shows a significant value, which confirms the random slope and intercept model fits our data better.
Lastly, let us run our random slope and intercept model but now allowing for treatment X week to see if it affects the BPRS. We immediately compare it with the previous model.
bprs_ref2 <- lmer(BPRS ~ week * treatment + (week | subject), data = bprs, REML = F)
summary(bprs_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: BPRS ~ week * treatment + (week | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0513 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 65.0048 8.0626
## week 0.9687 0.9842 -0.51
## Residual 96.4702 9.8219
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2522 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
anova(bprs_ref2, bprs_ref1)
## Data: bprs
## Models:
## bprs_ref1: BPRS ~ week + treatment + (week | subject)
## bprs_ref2: BPRS ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## bprs_ref1 7 2745.4 2772.6 -1365.7 2731.4
## bprs_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
At this point, the Chi squared p-val is not significant. this shows that the interaction model is not better, further supporting that there are no significantly different rates of change in the BPRS between the different treatments.
Now, let us compare the original data and the fitted values by the random intercept and random slop regression.
g1 <- bprs %>%
ggplot(aes(x = week, y = BPRS)) +
geom_line(aes(linetype = subject)) +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_wrap("treatment", labeller = label_both) +
theme_bw() +
theme(legend.position = "none")
Fitted <- fitted(bprs_ref1)
bprs <- bprs %>%
mutate(Fitted)
g2 <- bprs %>%
ggplot(aes(x = week, y = Fitted)) +
geom_line(aes(linetype = subject)) +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_wrap("treatment", labeller = label_both) +
theme_bw() +
theme(legend.position = "none")
ggarrange(g1, g2, ncol = 1)
The plot shows very clearly that there is no discernible pattern to differentiate one treatment from another.